DHTHBRSE 



Database, 2014, 1-9 
doi: 10.1093/database/bau032 
Original article 



Original article 

FixPred: a resource for correction of erroneous 
protein sequences 

Alinda Nagy and Laszio Patthy* 

Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, 
H-1113 Budapest, Hungary 

*Correspondlng author: Tel: +361 279 3100; Fax: +361 466 5465; Email: patthy.laszlo@ttk.mta.hu 

Citation details: Nagy.A. and Patthy.L. FixPred: a resource for correction of erroneous protein sequences. Database 
(2014) Vol. 2014: article ID bau032; doi:10.1093/database/bau032 

Received 9 January 2014; Revised 27 February 2014; Accepted 15 March 2014 

Abstract 

Protein databases are heavily contaminated with erroneous (mispredicted, abnormal 
and incomplete) sequences and these erroneous data significantly distort the conclu- 
sions drawn from genome-scale protein sequence analyses. In our earlier work we 
described the MisPred resource that serves to identify erroneous sequences; here we 
present the FixPred computational pipeline that automatically corrects sequences identi- 
fied by MisPred as erroneous. The current version of the associated FixPred database 
contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo 
sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gal I us gal I us, 
Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, 
Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred 
database will include corrected sequences of additional Metazoan species. The FixPred 
computational pipeline and database (http://www.fixpred.com) are easily accessible 
through a simple web interface coupled to a powerful query engine and a standard web 
service. The content is completely or partially downloadable in a variety of formats. 
Database URL: http://www.fixpred.com 



Introduction 

Medical sciences, drug development, agriculture and bio- 
technology rely increasingly on information originating 
from genome projects. One of the most crucial steps in the 
interpretation of genome sequences is the computational 
identification of protein-coding genes and prediction of 
their structure, the success of all subsequent steps of pro- 
tein research exploiting genomic sequences depends on the 
quality of these predictions. 



Despite significant improvements in gene-prediction 
technologies, prediction of the structure of protein-coding 
genes of higher eukaryotes remains a difficult task; accord- 
ing to current estimates, the structure of only ~60% of pre- 
dicted human genes is correct (1,2). Because erroneous data 
generated by misprediction are carried forward en masse to 
other databases and biological conclusions are drawn from 
the erroneous data, this may significantly distort the results 
of genome-scale protein sequence analyses (3-9). 



©The Author(s) 2014. Published by Oxford University Press. Page 1 of 9 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/licenses/by/3.0/), which permits 
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 

(page number not for citation purposes) 



Database, Vol. 2014, Article ID bau032 



Page 2 of 9 



To solve this problem — in our earlier MisPred project — 
we have developed a method that helps decide whether the 
structure of an in silico predicted or experimentally sup- 
ported protein-coding gene is erroneous (mispredicted, 
abnormal or incomplete). The MisPred approach is based 
on the principle that the structure of a protein-coding gene 
is likely to be erroneous if some of the features of the pro- 
tein-coding gene or the predicted protein conflict with 
some of the dogmas about protein-coding genes and pro- 
teins (3, 10). The current version of the MisPred computa- 
tional pipeline uses 11 distinct tools to identify erroneous 
sequences affected by different types of errors (10). 

Tool 1. Conflict between the presence of obligatory extra- 
cellular protein domain(s) (http://www.mispred.com/ 
tableltoS) in a protein and the absence of appropriate se- 
quence signals that could direct the extracellular domain(s) 
into the extracellular space. 

Tool 2. Conflict between the presence of obligatory extra- 
cellular and cytoplasmic domains (http://www.mispred. 
com/tablelto3) in a protein and the absence of transmem- 
brane helix(ces). 

Tool 3. Co-occurrence of obligatory extracellular and 
nuclear domains (http://www.mispred.com/tablelto3) in a 
protein. 

Tool 4. Domain size deviation. 

Tool 5. Inter-chromosomal chimeric proteins. 

Tool 6. Conflict between the presence of secretory signal 

peptide and obligatory cytoplasmic protein domains 

(http://www.mispred.com/tablelto3) in a protein and the 

absence of transmembrane segments. 

Tool 7. Conflict between the presence of GPI-anchor in a 
protein and the absence of secretory signal peptide. 
Tool 8. Co-occurrence of GPI-anchor and obligatory cyto- 
plasmic protein domains (http://www.mispred.com/ 
tablelto3) in a protein. 

Tool 9. Co-occurrence of GPI-anchor and obligatory 
nuclear protein domains (http://www.mispred.com/ 
tablelto3) in a protein. 

Tool 10. Co-occurrence of GPI-anchor and transmem- 
brane segments in a protein. 
Tool 11. Domain architecture deviation. 

By identifying mispredicted protein sequences, the 
MisPred pipeline serves to inform the creators of the pre- 
dictive algorithms of the reliability of predictions, thereby 
assisting the improvement of gene prediction technologies. 
The MisPred approach is also useful for the identification 
of abnormal and incomplete proteins: with the help of the 
MisPred tools we could show that a significant proportion 
of alternatively spliced mRNAs do not encode viable pro- 
teins (11) and that many, allegedly full-length cDNAs are 
in fact incomplete (3). 



Elimination of erroneous entries from public databases 
is of crucial importance because it might protect users 
from drawing erroneous conclusions based on erroneous 
data, but it is even more important to correct erroneous se- 
quences. Given the flood of erroneous data from genome 
projects, there is an increasing need for computational 
tools that perform these corrections on a mass scale. 

The main objective of our FixPred project is to develop 
the FixPred pipeline for the automatic correction of 
sequences identified by MisPred as erroneous and to con- 
struct the FixPred database in which corrected versions of 
erroneous sequences are deposited. 

The FixPred pipeline 

We have designed the FixPred pipeline to correct abnor- 
mal, incomplete and mispredicted proteins primarily from 
Metazoan species. The rationale of the FixPred approach is 
that an erroneous sequence (identified as such by MisPred) 
is judged to be corrected if the correction eliminates the 
error(s) identified by MisPred. 

Note that MisPred does not only state that a sequence is 
erroneous but also identifies the 'type of error', thereby 
pinpointing the 'location of the error' (Supplementary 
Figure SI). For example, if a protein is identified as errone- 
ous by MisPred tool 1 (i.e. the protein contains domains 
that occur exclusively in the extracellular space but lacks a 
secretory signal peptide or signal anchor sequence that 
could direct the domain into the extracellular space), then 
we know that the error affects the N-terminal part of the 
sequence and this error may be corrected by identifying the 
missing secretory signal peptide or signal anchor sequence 
(Supplementary Figure SIA). Similarly, if a protein is iden- 
tified as erroneous by MisPred tool 2 (i.e. it contains both 
extracellular and cytoplasmic protein domains but lacks 
transmembrane helices that pass through the membrane), 
then we know that the error is located internally, between 
the extracellular and cytoplasmic domains of the protein, 
and this error may be corrected by identifying the missing 
transmembrane helix (Supplementary Figure SIB). 

There are multiple ways to correct an erroneous protein 
sequence deposited in a database: (i) the correct sequence 
may already exist in other protein databases; (ii) protein, 
cDNA and EST databases may contain sufficient amount 
of information to assemble a corrected version of the erro- 
neous sequence; (iii) a corrected version of the protein may 
be predicted by subjecting the genome sequence to compu- 
tational gene predictions. 

The FixPred pipeline attempts to correct erroneous 
sequences in several steps, starting with the simplest solu- 
tions (finding experimental evidence for the correct se- 
quence version in existing protein or cDNA and EST 



Pages of 9 



Database, Vol. 2014, Article ID bau032 



Erroneous sequence identifled by MisFred 



T 



Correct version of the sequence is found in other protein databases? 



No 



T 



Correct protein sequence can be reconstructed from protein fragments? 



No 



T 



Correct protein sequence can be reconstructed using ESTs and cDNAs? 



No 



T 



Correct protein sequence can be predicted by 
homology-based gene prediction? 



No 



T 



Correct protein sequence can be predicted by de novo 
gene prediction? 




FixPred 
Database 



Figure 1. Flow chart of the FixPred pipeline. 



databases), progressing to more time-consuming gene- 
predictions. The FixPred software package corrects errone- 
ous sequences according to a dichotomic decision tree (see 
flow chart in Figure 1): 

Step 1. MisPred identifies a sequence as erroneous. The se- 
quence is used as input of the FixPred pipeline. Because the 
false-positive rates of some MisPred tools are relatively 
high (3, 4, 10), users are advised to subject sequences iden- 
tified by MisPred as suspicious to additional analyses to de- 
cide whether the protein is truly erroneous or a false 
positive before they subject it to sequence correction by the 
FixPred pipeline. 

Step 2. Search for a correct version of the erroneous se- 
quence in other protein databases. If these searches find a 
correct version (that is not affected by the error detected by 
MisPred), then the corrected version is deposited in the 
FixPred database and the correction procedure is termi- 
nated. If these searches fail to find a correct version, then 
the erroneous sequence is used as input in Step 3. 
Step 3. Reconstruction of a corrected protein sequence 
using overlapping protein fragments. If sequence searches 
in Step 2 identified fragments that overlap with the errone- 
ous sequence but differ from it in the region affected by the 
error, then FixPred uses the overlapping fragments to re- 
construct sequences (Supplementary Figure S2). If these re- 
constructions correct the error identified by MisPred, then 
the corrected sequence is deposited in the FixPred database 
and the correction procedure is terminated. If this step fails 
to reconstruct a corrected version, then the erroneous se- 
quence is used as input in Step 4. 



Step 4. Reconstruction of a corrected protein sequence 
using overlapping ESTs or cDNAs. ESTs/cDNAs that over- 
lap with the erroneous sequence but differ from it in the 
region affected by the error are used to reconstruct se- 
quences. If these reconstructions correct the error identified 
by MisPred, then the corrected sequence is deposited in the 
FixPred database and the correction procedure is termi- 
nated. If this step fails to reconstruct a correct version then 
the erroneous sequence is used as input in Step 5. 
Step 5. Homology-based prediction of a corrected version 
of the erroneous sequence using genomic sequence. The 
erroneous sequence is used to search for non-erroneous 
homologs from the same species (paralogs) and from other 
species (orthologs and paralogs). The genomic region that 
encodes the erroneous sequence is subjected to homology- 
based gene prediction, using the closest non-erroneous 
homologs. If the predictions include sequences (or se- 
quence fragments) that are not affected by the original 
error then these are used to correct the erroneous sequence. 
As predictions that correct the original error may introduce 
errors elsewhere, only the corrected region is used in the 
reconstruction of the corrected version. If these reconstruc- 
tions correct the error identified by MisPred, then the cor- 
rected sequence is deposited in the FixPred database and 
the correction procedure is terminated. If this step fails 
to reconstruct a corrected version, then the erroneous 
sequence is used as input in Step 6. 

Step 6. De novo prediction of a corrected version of the 
erroneous sequence using genomic sequence. The genomic 
region that encodes the erroneous sequence is analyzed 



Database, Vol. 2014, Article ID bau032 



Page 4 of 9 



with tools of de novo gene prediction. If the predictions in- 
clude sequences (or sequence fragments) that are not af- 
fected by the original error, then these are used to correct 
the erroneous sequence. As predictions that correct the ori- 
ginal error may introduce errors elsewhere, only the cor- 
rected region is used in the reconstruction of the correct 
version. If these reconstructions correct the error identified 
by MisPred, then the corrected sequence is deposited in the 
FixPred database and the correction procedure is 
terminated. 

There are several arguments in favor of the decision tree 
outlined above. Although there is no guarantee that Steps 
2, 3 and 4 succeed in finding experimental evidence for a 
correct version, this is the simplest and most straightfor- 
ward way of error correction. If these steps fail, FixPred 
proceeds to Steps 5 and 6 that are more time-consuming 
but have a chance to succeed if the genome sequence is 
known. 

The FixPred pipeline exploits public databases and a 
variety of standard software: 

Steps 2 and 3. In these steps, the pipeline uses the errone- 
ous sequence as a query to search the UniProtKB/Swiss- 
Prot, UniProtKB/TrEMBL (12), EnsEMBL (13) and NCBI/ 
RefSeq (14, 15) protein databases with blastp (16) limiting 
the search to the same species as the source of the query se- 
quence. FixPred selects protein sequences that are >98% 
identical (allowance for sequencing errors and polymorph- 
isms) with the query sequence over >25 residues, and these 
sequences are analyzed by the same MisPred tools as the 
ones that identified the query sequence as being suspicious. 
Sequences that are not affected by the errors that affected 
the query sequence are concluded to be the correct versions 
of the erroneous sequence. If the analysis finds only protein 
fragments that overlap with the query sequence but differ 
from it in the region affected by the error, then the errone- 
ous sequence is corrected with these overlapping frag- 
ments, eliminating the error through the assembly of the 
fragments (Supplementary Figure S2). 
Step 4. In this step, key ESTs and cDNAs are identified 
with the query sequence or with the closest non-erroneous 
homologs of the query sequence. 

First, the erroneous sequence is used as query to search 
EST and cDNA databases (15) with tblastn (16), hmiting 
the search to the species from which the erroneous se- 
quence originates. The program selects sequences that are 
>80% identical (allowance for sequencing errors and poly- 
morphisms) with the query sequence over >25 amino acid 
residues. EST or cDNA sequences thus selected are trans- 
lated in the reading frame corresponding to the query se- 
quence using Transeq (17). If these analyses find fragments 



that overlap with the erroneous sequence but differ from it 
in the region affected by the error, then the erroneous se- 
quence is corrected with these overlapping sequences, elim- 
inating the error by the assembly of the fragments 
(Supplementary Figure S3). 

If the search with the erroneous sequence failed to find 
ESTs/cDNAs satisfying these criteria (lack of extensive 
overlap over >25 residues), then the closest non-erroneous 
homologs are used to identify key ESTs/cDNAs that may 
originate from the region affected by the error. To find 
non-erroneous homologs, FixPred uses the erroneous se- 
quence as a query to search the UniProtKB/Swiss-Prot, 
UniProtKB/TrEMBL, EnsEMBL and NCBI/RefSeq protein 
databases using blastp, extending the search to other 
Metazoan species. Fifty homologs with the lowest E-values 
(E-value <10^^^) are analyzed by MisPred, and FixPred se- 
lects sequences that do not show signs of the same type of 
error as the query (Supplementary Figure S4). 

The closest non-erroneous homologs with the highest 
percent identity are used to identify ESTs/cDNAs that 
might be used for the correction of the erroneous sequence. 
To achieve this, the sequence region that distinguishes the 
correct homolog from the erroneous sequence, plus 30 
amino acid residues of the regions where their sequences 
overlap, is used as query to search EST and cDNA data- 
bases with tblastn, limiting the search to the species from 
which the erroneous sequence originates. FixPred selects 
homologous EST or cDNA sequences that are >50% iden- 
tical (at the amino acid level) with the non-erroneous 
homolog over >25 amino acid residues. Selected EST se- 
quences are translated in the reading frame corresponding 
to the query sequence using Transeq (17). If these analyses 
find fragments that are identical with the erroneous se- 
quence over >10 amino acid residues but differ from it in 
the region affected by the error, then the erroneous se- 
quence is corrected with these overlapping sequences, elim- 
inating the error by the assembly of the fragments 
(Supplementary Figure S4). 

Step 5. In this step, the closest non-erroneous homologs 
(with highest per cent identity) are used to predict the cor- 
rect version of the erroneous sequence from genomic se- 
quence. First, the erroneous sequence is used as a query to 
identify the genomic region containing the gene for the 
protein using tblastn, and then the sequence of the closest 
non-erroneous homolog is used to find evidence for the 
parts that distinguish the erroneous and non-erroneous se- 
quences. If the latter search finds evidence for exons resolv- 
ing the error, then the genomic region encoding the query 
protein is extended to include regions that are expected to 
encode the correct exons. The genomic region thus selected 
is subjected to gene prediction with GeneWise (18), using 



Page 5 of 9 



Database, Vol. 2014, Article ID bau032 



the sequence of the non-erroneous homolog as input. 
Predicted protein sequences that resolve the original error 
are used to correct the erroneous sequence. Because the 
prediction that corrects the original error may introduce 
errors elsewhere, only the corrected region is used in the 
reconstruction. 

Step 6. In this step, the genomic region encoding the erro- 
neous sequence is analyzed with de novo gene-finding pro- 
grams GeneScan (19) and Augustus (20). Predicted protein 
sequences that resolve the original error are used to correct 
the erroneous sequence. Because the prediction that cor- 
rects the original error may introduce errors elsewhere, 
only the corrected region is used in the reconstruction. 

The FixPred database 

The current version of the FixPred database contains 1462 
corrected UniProtKB/Swiss-Prot and NCBI/RefSeq se- 
quences from Homo sapiens, Mus musculus, Rattus norve- 
gicus, Monodelphis doniestica, Gallus gallus, Xenopus 
tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, 
Branchostoma floridae, Drosophila melanogaster and 
Caenorhabditis elegans; future releases of the FixPred data- 
base will include corrected sequences of additional 
Metazoan species. 

FixPred entries contain the FixPred ID, the species 
name, the corrected protein sequence (in FASTA format), 
the type of evidence on protein existence and the date of 
publication. The FixPred database contains two types of 
corrected sequences. Sequences with FXP identifiers denote 
sequences that were corrected automatically by the 
FixPred pipeline and checked manually by an expert. The 
FIXEXP identifiers denote sequences where the corrected 
sequence was validated experimentally by cDNA cloning. 

The 'Protein existence' field indicates the type of evi- 
dence that supports the existence of the protein. FixPred 
lists four types of evidences for the existence of a protein: 
(i) evidence at protein level; (ii) evidence at transcript level; 
(iii) evidence at EST level; and (iv) evidence at genome 
level. Only the highest or most reliable level of supporting 
evidence for the existence of a protein is displayed for each 
entry. For example, if the existence of a protein is sup- 
ported by both ESTs and cDNAs, then the 'Protein exist- 
ence' field indicates 'Evidence at transcript level'. 

The data sheets of the corrected protein sequences also 
contain information about the original protein sequences 
identified by MisPred as erroneous: the protein ID, the pro- 
tein description, the database source, the species name, the 
erroneous protein sequence (in FASTA format) and the 
type of sequence error(s) identified by MisPred are pro- 
vided. (It should be noted that if the same erroneous se- 
quence was deposited multiple times in the same or 



different databases, several protein IDs and database sour- 
ces are listed.) A typical example of a FixPred entry is 
shown in Figure 2. 

Database statistics 

The 1462 corrected sequences of the current version of the 
FixPred database were generated by the FixPred pipeline 
through the analysis of 8118 erroneous UniProtKB/Swiss- 
Prot and NCBI/RefSeq entries from 12 Metazoan species. 
It must be pointed out that this ratio (0.18) of corrected 
and erroneous sequences does not equal the rate of correc- 
tion. Frequently, the same erroneous sequence is deposited 
multiple times in the same or different databases with dif- 
ferent protein IDs, and because their correction by the 
FixPred pipeline yields the same sequence, the rate of cor- 
rection may be underestimated. Another source of under- 
estimation is that several different erroneous versions of 
the same protein may be present in public databases and 
their correction may yield the same corrected sequence. 
Conversely, in the case of fusion proteins (e.g. inter- and 
intra-chromosomal chimeras), correction of a single erro- 
neous sequence is expected to yield two corrected se- 
quences through the separation of the constituent proteins. 

Despite these caveats, it is worth noting that the appar- 
ent rate of correction is lower in the case of UniProtKB/ 
Swiss-Prot entries than in the case of RefSeq sequences: a 
total of 61 corrected sequences were obtained through the 
analysis of 456 Swiss-Prot sequences (13.4%), whereas 
analysis of 7662 NCBI/RefSeq entries resulted in 1418 cor- 
rected sequences (18.5%). This observation is in harmony 
with the higher quality of the Swiss-Prot database and the 
fact that the NCBI/RefSeq database contains a relatively 
high proportion of mispredicted sequences. 

With respect to the type of sequence error, extracellular 
proteins lacking secretory signal peptides and protein se- 
quences affected by domain size deviation constitute the 
largest proportion of erroneous sequences and the appar- 
ent rates of their correction are also among the highest 
(23.5 and 29.1%, see Table 1), making MisPred Tools 1 
and 4 the most valuable constituents of the MisPred and 
FixPred pipelines. Co-occurrence of extracellular and nu- 
clear domains in a protein is a relatively rare type of error 
(identified by MisPred tool 3) but its rate of correction is 
very high (Table 1) even if we take into account the fact 
that correction of an erroneous fusion sequence is expected 
to yield two corrected sequences (see Supplementary 
Figure SIC). 

The apparent rate of correction shows significant vari- 
ation with respect to the species from which the erroneous 
sequence originates. As shown in Table 2, the highest rate 
of correction is observed in the case of H. sapiens 



Database, Vol. 2014, Article ID bau032 



Page 6 of 9 



Corrected protein sequence: 



FixPred ID: 


FXP0000000419 


Species: 


Oallus gallus 


Corrected protein sequence: 


MAVSMHFCCnjLLDVLVGCTRGHSLrSCEPIIlRMCQDLPYinTrHPNLLmiTOQirrAALAMEPFHPHVMLECSRDre^ 

LCAIYAPVCHEYGRVTLPCFIRLCQRAYSECSKLHEHFGVSUPEDHECTOFPDCDEPYPRLVDLSLGGEPTEEAPHAVQRD 

YGFUCPRELKIDPDLGYSFLRVEDCSPPCPNMYFREEELSFARYFIGVISIVCLSATLrrFLTFLIDVTRFRYPERPIir 

YAVC YMMVS LIFFIGFL LEDRVACWASS P AQ YKASTVTQ GSHNKACTHL FHVLYFFTHAGS VTOVI LTITWFL AAVPKUG 

SEAIEKKALLFHASAUGIPGTLTIVLLAMHKIEGDNISGVCFVGLYDVDALRYFVLAnCLYVWGVSLLLAGIISLMRV 

RIEIPLEKENQDKLVKFMIRIGVFSVLYLVPLLWIGCYFYEQAYRGVUETTUIQERCREYHIPCPYQVTQHSRPDLILF 

LHKYLMALWGIPSVFMVGSKCTCFEMASFFHGRRKKEVVHESRQVLQEPDFAQSLLRDPHTPIIRKSRGTSTQGTSTHA 

SSTQLAMLDDQRSKAGSVHSKVSSYHGSLHRSRDGRYTPCSYRGIEERLPHGSHSRLTDHSRHSSSHRLNEQSRHSSIRD 

LSSNPLTHITHGTSMHRVIEEDGTSA 


Protein existence: 


Evidence at protein level 


Date of publication: 


2013-08-2212:14:04 



Protein sequence identified by IVIisPred as erroneous: 

[protein ID Protein description 



FZD3 CHICK >sp|Q9PTW3|FZD3_CHICK Frizzled-3 (FragnnenO OS=Gallus gallus Gh4=FZD3 PE=3 SVfel 



Database source 



UniProtKB/SwissProt 



Species: 


Oallus gallus 


Protein sequence: 


YFE!U>IimVCYIIHVSLIFFIGFlLEDRVACIIA3SPAQYKASTVTQGSHNKAnBLn!VLTFrnUG 


Type or error identified by MisPred: 


Conflict 4: Domain size deviation 



Figure 2. Screen shot of an entry of the FixPred database. The figure shows the corrected version (upper part) of an erroneous protein sequence of G. 
gallus, deposited in the UniProtKB/SwissProt database with the protein ID: FZD3_CHICK (lower part). The FZD3_CH1CK protein was identified as erro- 
neous by IVIisPred tool 4 (domain size deviation) because it contains only a fragment of the Frizzled (PF01534) domain. The erroneous protein was cor- 
rected by the FixPred pipeline in Step 2 by identifying a full-length version of the frizzled-3 precursor (NP_001 258869.1 ). 



Table 1. Rate of correction of different types of sequence 
errors 



Error type identified" 


Erroneous 


Corrected 


Apparent rate of 




sequences 


sequences 


correction ( % ) 


MisPred tool 1 


3394 


799 


23.5 


MisPred tool 2 


10 


2 


20.0 


MisPred tool 3 


12 


16 


133.3^' 


MisPred tool 4 


2033 


592 


29.1 


MisPred tool 5 


890 


36 


4.0 


MisPred tool 6 


916 


4 


0.4 


MisPred tool 7 


479 


32 


6.7 


MisPred tool 8 


50 


0 


0.0 


MisPred tool 9 


3 


0 


0.0 


MisPred tool 10 


331 


3 


0.9 



■^Erroneous sequences identified by MisPred tool 1 1 and corrected by the 
FixPred pipeline are not yet deposited in the FixPred database. These data 
will be released in the next update of FixPred. 

''In the case of MisPred tool 3, correction of an erroneous sequence 
containing both nuclear and extracellular domains is expected to yield two 
corrected sequences (Supplementary Figure SIC). 

sequences, whereas the lowest rates are observed in the 
case of B. floridae, C. intestinalis and D. rerio. The most 
plausible explanation for these differences in rate of correc- 
tion is that they reflect differences in the availability of 



Table 2. Rate of correction of erroneous sequences of differ- 
ent metazoan species 



Species 


Erroneous 
sequences 


Corrected 
sequences 


Apparent rate of 
correction ( % ) 


H. sapiens 


941 


331 


35.2 


M. musculus 


455 


106 


23.3 


R. norvegicus 


704 


178 


25.3 


M. domestica 


434 


93 


21.4 


G. gallus 


458 


118 


25.8 


X. tropicalis 


547 


176 


32.2 


D. rerio 


1376 


180 


13.1 


F. ruhripes 


507 


97 


19.1 


B. floridae 


1753 


46 


2.6 


C. intestinalis 


391 


28 


7.2 


C. elegans 


215 


49 


22.8 


D. melanogaster 


337 


60 


17.8 



experimental information on protein sequences (full length 
proteins in other databases, protein fragments, cDNAs 
etc.) that facilitate the correction process through Steps 2 
and 3 of the FixPred pipeline. This interpretation is sup- 
ported by our observation that the highest proportion of 
the corrections was completed in Steps 2 (56.2%) and 3 
(37.0%), i.e. a correct version of the erroneous sequence is 



Page? of 9 



Database, Vol. 2014, Article ID bau032 



Table 3. Correction of erroneous proteins in different steps of 
the FixPred pipeline (see Figure 1) 



Steps 


Sequences 
analyzed 


Sequences 
corrected 


Proportion 
corrected 


Percent of 
total correction 


Total 


8118 


1462 


0.18 


100 


Step 2 


8118 


822 


0.10 


56.2 


Step 3 


7296 


541 


0.07 


37.0 


Step 4 


6755 


75 


0.01 


5.1 


Step 5 


6680 


73 


0.01 


5.0 


Step 6 


6607 


21 


0.00 


1.4 



present in other databases or can be reconstructed from 
fragments (Table 3). Note that in the 'Sequences corrected' 
column of Table 3 the sum of sequences corrected in Steps 
2-6 (1532) exceeds the total number of sequences cor- 
rected (1462). The primary source for this difference is 
that the same erroneous sequence is deposited multiple 
times in the same or different databases with different pro- 
tein IDs but their correction by the FixPred pipeline yields 
the same sequence. 

Database implementation 

The database is built on an Apache HTTP Server 2.2.6 
with Oracle Database llg Server. The front end was de- 
veloped using play! 1.2.4 (http://www.playframework.org) 
framework with HTML and JAVA script, and the back 
end was developed using Oracle Database llg Server, a 
relational database management system. All common gate- 
way interface and database interfacing scripts were written 
in Java programming language. 

Web interface 

The FixPred web interface is designed to explain the 
goals and principles of the MisPred and FixPred projects 
(web page ABOUT FIXPRED) and to allow the user to 
rapidly query the complete database (web page SEARCH 
FIXPRED) or to use the various FixPred tools to correct 
sequences identified as erroneous by MisPred (web page 
CORRECT YOUR SEQUENCE). 

Search tools 

FixPred provides three search options on the 'SEARCH 
FIXPRED' page: the simple, the advanced and the similar- 
ity search options. The Simple search option allows users 
to query any field of the database entries (protein ID in the 
source database, FixPred ID of the corrected sequence, 
protein description, database source, species name, type of 



sequence error identified by MisPred). Under the 
Advanced search option, users can combine queries of the 
different fields using the AND, OR and NOT operators. 

The 'Find best match of your sequence in FixPred data- 
base by similarity' feature of SEARCH FIXPRED is meant 
to find the corrected version of an erroneous query se- 
quence. The idea behind this feature is to spare the time 
of sequence correction if the correct sequence is already 
deposited in the FixPred database. 

On initiating the search, the IDs of all protein sequences 
matching the criteria of the search are displayed. In the 
case of similarity searches, the alignments also may be dis- 
played. For each protein sequence retrieved (see Figure 2), 
a detailed result page is displayed (via a hnk of the protein 
ID), providing basic information about the corrected pro- 
tein sequence, including FixPred ID, species name, amino 
acid sequence of the corrected protein, the type of evidence 
of the protein existence, the date of publication, as well as 
about the original protein sequences identified by MisPred 
as erroneous, including the protein ID, the protein descrip- 
tion, the database source, the species name, the erroneous 
protein sequences and the type of sequence error(s) identi- 
fied by MisPred. Links to the source databases are also 
provided to help the user retrieve supplementary informa- 
tion about the original protein. Selected sequences may be 
downloaded in a variety of formats (XML, EXCEL, 
FASTA, LIST). 

Sequence analysis tools 

Users can correct protein sequences identified as erroneous 
by MisPred on the 'CORRECT YOUR SEQUENCE' page 
using the FixPred pipeline. The results of the analysis are 
accessible in two different ways: without registration the 
results are available via a link for 72 h; registered users can 
access their results on the 'Recent Results' page for 
20 days. The result page is divided into two parts (see 
Figure 3A and B). The first section displays information 
about the erroneous protein sequence submitted for ana- 
lysis (automatically generated sequence ID, species 
name, protein sequence, task status, date and time of the 
completion of the MisPred analysis), the original MisPred 
annotations (presence or absence of signal peptide, trans- 
membrane helices, etc.) and conclusion of the MisPred 
analysis: lists the type(s) of sequence error(s) identified by 
the MisPred tools. The second section shows the same 
information about the corrected protein sequence. 

Conclusions and future perspectives 

In the future, the correction of erroneous protein sequences 
will be extended to sequences originating from other 



Database, Vol. 2014, Article ID bau032 



Page 8 of 9 



A Protein sequence identified by MisPred as erroneous: 



Sequence D: 


adc 


Species: 


iHomo sapiens 


Protein sequence: 


HHHBETSGfTLKKGRSAPLVFHPPDAlIAVPFDDDDKIVGCYTCEEHSLPYQVSLHSGSHrCGGSLISEQOWSAAHCVK 
TRIQVRLGEHNIKVLEGMEQFINAAKIIRHPKYMEDTLDHDIHLIKLSSPAVINARVSTISLPTAPPAAGTECLISGHGN 
TLSFGADYPDELKCLDAPVLTQAECICASYPGKITHSMFCVGFLECGia)SCQRDSGGPWCNGQlQGWSHGHGCAHKHRP 
GVYTKVYHYVDMIKDTIAANS 


Finished: 


2013-12-1315:42:43 



Original annotations: 



Code 


Description 


Decision 


SP 

EXT 

TM 


SignalP detects signal peptide 


NO 


Pfam detects extracellular Pfam-A domains 


YES 


Pliobius detects transmembrane tielix 


NO 



Original conflicts: 



Type of sequence error 


Description 


Decision 


Conflict 1 : 


Conflict between the presence of extracellular Pfam-A domains and the absence of appropriate sequence signals 


YES 



B Corrected protein sequence: 



Species: 


Homo sapiens 


Corrected protein sequence: 


roiPFLILAFVGAlIAVPFDDDDKIVGGVTCEEHSlPYQVSlHSCSHFCGGSLISEQWWSAAHCYKTRIQVRLGEHIIIKV 
LEGNEQFINAAKIIRHPKYNRDTLDNDIHLIKLSSPAVIKARVSTISLPTAPPAAGTECLISGUGNTLSFGADYPDEIKC 
LDAPVLTQAECKASYPGKITHSHFCVGFLEGGKDSCQRDSGCPVVCHCCILQGVVStJGHGCAtJKIlRPGVYTKVYHYTOtJIK 
DTIAAHS 


Annotations: 



Crnle 


Description 


Detdslon 


SP 


SignalP detects signal peptide 


YES 


EXT 


Pfam detects extracellular Pfam-A domains 


YES 


TM 


Phoblus detects transmembrane helix 


NO 



Conflicts: 



Type of sequence error 


Description 


Decision 


Conflict 1: 


Conflict between the presence of extracellular Pfam-A domains and the absence of appropriate sequence signals 


NO 



Figure 3. Correction of an erroneous protein sequence by the FixPred pipeline. (A) The upper part of the screen shot shows a H. sapiens protein se- 
quence (NP_001184026.2, trypsin-3 isoform 3 preproprotein) that was identified as erroneous by MisPred tool 1 because it has an extracellular do- 
main but lacks secretory signal peptide. (B) The erroneous protein was corrected by the FixPred pipeline in Step 2 by identifying a version 
(NP_002762.2, trypsin-3 isoform 2 preproprotein) that does not suffer from this type of error (see lower part of the screen shot). 



Metazoan species. We plan to update the sequence content 
of the FixPred database twice a year. 

Supplementary data 

Supplementary data are available at Database Online. 

Funding 

National Office for Research and Technology of Hungary 
(TECH_09_Al-FixPred9) and the Hungarian Scientific Research 
Fund (OTKA 101201). Funding for open access charge: Hungarian 
Scientific Research Fund (OTKA 101201). 

Conflict of interest. None declared. 



References 

1. Guig6,R., Flicek,P., AbrilJ.F. et al. (2006) EGASP: the human 
ENCODE genome annotation assessment project. Genome 
B/o/., 7 (Suppl. 1), S2. 

2. HarrowJ., Nagy,A., Reymond,A. et al. (2009) Identifying pro- 
tein-coding genes in genomic sequences. Genome Biol., 10, 201. 

3. Nagy,A., Hegyi,H., Farkas,K. et al. (2008) Identification and 
correction of abnormal, incomplete and mispredicted proteins in 
public databases. BMC Bioinformatics, 9, 353. 

4. Nagy,A. and Patthy,L. (2011) Reassessing domain architecture 
evolution of metazoan proteins: the contribution of different 
evolutionary mechanisms. Genes, 2, 578-598. 

5. Nagy,A., Szlama,G., Szarka,E. et al. (2011) Reassessing domain 
architecture evolution of metazoan proteins: major impact of 
gene prediction errors. Genes, 2, 449-501. 



Page 9 of 9 



Database, Vol. 2014, Article ID bau032 



6. Guo,B., Zou,M. and Wagner,A. (2012) Pervasive indels and 
their evolutionary dynamics after the fish-specific genome dupli- 
cation. Mol. Biol. Evol, 19, 3005-3022. 

7. Prosdocimi,F., Linard,B., Pontarotti,P. et al. (2012) 
Controversies in modern evolutionary biology: the imperative 
for error detection and quality control. BMC Genomics, 13, 5. 

8. Zhang,X., GoodsellJ. and Norgren,R.B. Jr. (2012) Limitations 
of the rhesus macaque draft genome assembly and annotation. 
BMC Genomics, 13, 206. 

9. Norgren,R.B. Jr. (2013) Improving genome assemblies and 
annotations for nonhuman primates. ILAR J., 54, 144-153. 

10. Nagy,A. and Patthy,L. (2013) MisPred: a resource for identifica- 
tion of erroneous protein sequences in public databases. 
Database (Oxford), 2013, bat053. 

11. Tress,M.L., Martelli,P.L., Frankish,A. et al. (2007) The implica- 
tions of alternative splicing in the ENCODE protein comple- 
ment. Proc. Natl Acad. Sci. USA, 104, 5495-5500. 

12. The UniProt Consortium. (2013) Update on activities at the 
Universal Protein Resource (UniProt) in 2013. Nucleic Acids 
Res., 41, D43-D47. 



13. Fhcek,P., Ahmed,!., Amode,M.R. et al. (2013) Ensembl 2013. 
Nucleic Acids Res., 41, D48-D55. 

14. Pruitt,K.D., Tatusova,T., Brown,G.R. et al. (2013) NCBI 
Reference Sequences (RefSeq): current status, new features and 
genome annotation policy. Nucleic Acids Res., 40, D130-D135. 

15. Benson,D.A., Cavanaugh,M., Clark,K. et al. (2013) GenBank. 
Nucleic Acids Res., 41, D36-D42. 

16. Altschul,S.F., Madden,T.L., Schaffer,A.A. et al. (1997) Gapped 
BLAST and PSLBLAST: a new generation of protein database 
search programs. Nucleic Acids Res., 25, 3389-3402. 

17. Rice,P., Longden,!. and Bleasby,A. (2000) EMBOSS: The 
European Molecular Biology Open Software Suite. Trends 
Genet., 16,276-277. 

18. Birney,E., Clamp,M. and Durbin,R. (2004) GeneWise and 
Genomewise. Genome Res., 4, 988-995. 

19. Burge,C. and Karlin,S. (1997) Prediction of complete gene struc- 
tures in human genomic DNA./. Mol. Biol., 268, 78-94. 

20. Stanke,M., Steinkamp,R., Waack,S. et al. (2004) AUGUSTUS: a 
web server for gene finding in eukaryotes. Nucleic Acids Res., 
32, W309-W312. 



